Goal/hypothesis
The goal of this pseudobulk DE analysis was to consider gene
expression differences between healthy and septic subjects in all CD66b+
cells in the dataset. To achieve this, we aggregated RNA-seq counts
across all CD66b+ cells for each subject. The only variable in the
dataset included in the DE model was “sepsis_status”. Sample_status,
sample_label, and sample_time were not considered.
The hypothesis being tested is related to a seahorse assay that shows
that in samples similar to those described above, the respiration rate
of septic subjects was decreased.
Therefore we are looking for gene expression changes in oxidative
metabolism genes.
Data
The scRNA dataset used for this analysis was generated by Jack Leary
j.leary@ufl.edu .
The dataset includes all cell types and has been integrated by
sample. Total cells were filtered for those with >0 expression of
CD66b+.
The input file is named
“MDSC_Seurat_CD66b_Only_Integrated_By_Sample.Rds”
Pseudobulk DE Analysis with DESeq2
For this analysis, RNA expression data were aggregated across all
cells per-sample. The DESeq2 formulat was exp ~ sepsis_status
The two levels of sepsis status are “Healthy” and “Septic”, and the
comparison was made in the direction “Septic relative to Healthy”
(i.e. positive fold-changes mean higher expression in septic
samples).
P-value adjustment was performed using the Benjamini-Hochberg
method.
Download DE analysis results
Download results of Pseudobulk Septic vs. Healthy DE analysis
Volcano plot
Boxplots of significantly (padj < 0.05) DE genes
---
title: "Differential Expression Analaysis (Septic vs. Healthy subjects) in Human CD66b+ cells"
author: "Heather Kates"
date: "2023-10-02"
output:
  html_document:
    theme: paper
    highlight: tango
    code_folding: show
    code_download: true
    toc: false
    toc_depth: 3
    df_print: kable
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, message = FALSE, warning = FALSE, comment = NA); set.seed(312)
```

```{r}
# Libraries 
## R
library(purrr)      # functional programming 
library(dplyr)      # data manipulation
library(Seurat)     # scRNA-seq tools
library(DESeq2)     # bulk DE testing
library(ggplot2)    # plots
library(biomaRt)    # gene names 
library(patchwork)  # plot alignment
```

```{r}
### Function Conflicts
select <- dplyr::select
filter <- dplyr::filter
rename <- dplyr::rename
reduce <- purrr::reduce
```

```{r}
# Helper Functions 
# connect to Ensembl database while allowing for connection timeouts
ensembl_connect <- function(dataset.use = "hsapiens_gene_ensembl", 
                            mart.use = "ensembl", 
                            n.tries = 10) {
  mart <- try(biomaRt::useDataset(dataset.use, biomaRt::useMart(mart.use)), silent = TRUE)
  i <- 1
  while (inherits(mart, "try-error") && i <= n.tries) {
    mart <- try(biomaRt::useDataset(dataset.use, biomaRt::useMart(mart.use)), silent = TRUE)
    i <- i + 1
  }
  if (inherits(mart, "try-error") && i == n.tries) {
    stop("Number of connection attempts exceeded; check Ensembl availability at www.ensembl.info")
  }
  return(mart)
}
# retrieve Ensembl ID to HGNC symbol mapping
get_gene_mapping <- function(mart.obj = NULL, ensembl.ids = NULL) {
  # check inputs
  if (is.null(mart.obj)) { stop("Please provide a Mart database.") }
  if (is.null(ensembl.ids)) { stop("Please provide a vector of Ensembl IDs.") }
  # run query & format
  gene_mapping <- data.frame(ENSEMBL_ID = ensembl.ids)
  query_res <- biomaRt::getBM(filters = "ensembl_gene_id",
                              attributes = c("ensembl_gene_id", "hgnc_symbol"), 
                              values = ensembl.ids, 
                              mart = mart.obj) %>% 
               dplyr::rename(ENSEMBL_ID = ensembl_gene_id, 
                             HGNC_SYMBOL = hgnc_symbol) %>% 
               dplyr::distinct()
  gene_mapping <- dplyr::left_join(gene_mapping, 
                                   query_res, 
                                   by = "ENSEMBL_ID") %>% 
                  dplyr::mutate(HGNC_SYMBOL = dplyr::if_else(HGNC_SYMBOL == "", NA_character_, HGNC_SYMBOL))
  return(gene_mapping)
}
# fetch Ensembl ID for a given HGNC symbol
get_ensembl <- function(gene.name = NULL, gene.table = gene_id_table) {
  # check inputs 
  if (is.null(gene.name)) { stop("Please provide an HGNC symbol to convert.") }
  # convert HGNC symbol to Ensembl ID
  ensembl_ID <- dplyr::filter(gene_id_table, HGNC_SYMBOL == gene.name) %>% 
                dplyr::pull(ENSEMBL_ID)
  if (inherits(ensembl_ID, "data.frame") && nrow(ensembl_ID) == 0) {
    stop("No matching Ensembl IDs were found in the given gene mapping table.")
  } 
  ensembl_ID <- as.character(ensembl_ID)
  if (length(ensembl_ID) == 0) {
    stop("No matching Ensembl IDs were found in the given gene mapping table.")
  }
  if (length(ensembl_ID) > 1) {
    warning("Multiple Ensembl IDs match the given HGNC symbol.")
  }
  return(ensembl_ID)
}
```

# Goal/hypothesis

The goal of this pseudobulk DE analysis was to consider gene expression differences between healthy and septic subjects in all CD66b+ cells in the dataset. To achieve this, we aggregated RNA-seq counts across all CD66b+ cells for each subject. The only variable in the dataset included in the DE model was "sepsis_status". Sample_status, sample_label, and sample_time were not considered.

The hypothesis being tested is related to a seahorse assay that shows that in samples similar to those described above, the respiration rate of septic subjects was decreased. 

Therefore we are looking for gene expression changes in oxidative metabolism genes.

# Data

The scRNA dataset used for this analysis was generated by Jack Leary <j.leary@ufl.edu>. 

The dataset includes all cell types and has been integrated by sample. Total cells were filtered for those with >0 expression of CD66b+.

The input file is named "MDSC_Seurat_CD66b_Only_Integrated_By_Sample.Rds"

```{r}
seu_int_hmy_samp_cd66b <- readRDS("MDSC_Seurat_CD66b_Only_Integrated_By_Sample.Rds")
mart <- ensembl_connect()
gene_id_table <- get_gene_mapping(mart, rownames(seu_int_hmy_samp_cd66b))
```

# Pseudobulk DE Analysis with `DESeq2`

For this analysis, RNA expression data were aggregated across all cells per-sample. The DESeq2 formulat was exp ~ sepsis_status

The two levels of sepsis status are "Healthy" and "Septic", and the comparison was made in the direction "Septic relative to Healthy" (i.e. positive fold-changes mean higher expression in septic samples).

P-value adjustment was performed using the Benjamini-Hochberg method.

```{r}
##### pseudobulk aggregation per-sample
#Aggregation
pb_list <- data.frame(AggregateExpression(seu_int_hmy_samp_cd66b, 
                                     assays = "RNA",
                                     group.by = "sample", 
                                     slot = "counts"))
#Edit pb_list colnames
colnames(pb_list) <- gsub("RNA.","",colnames(pb_list))
colnames(pb_list) <- gsub("\\.","-",colnames(pb_list))

sample_metadata <- distinct(seu_int_hmy_samp_cd66b@meta.data, 
                            sample, 
                            sample_subject, 
                            sample_label, 
                            sample_status, 
                            sample_time, 
                            sepsis_status) %>% 
               mutate(across(where(is.factor), droplevels)) %>% 
                    magrittr::set_rownames(.$sample)
sample_metadata$sample <- gsub("_","-",sample_metadata$sample)
rownames(sample_metadata) <- gsub("_","-",rownames(sample_metadata))

deseq_obj <- DESeqDataSetFromMatrix(pb_list[, sample_metadata$sample], 
             colData = sample_metadata, 
             design = ~sepsis_status)
   
deseq_de <- DESeq(deseq_obj)

deseq_res <- results(deseq_de, 
                         name = resultsNames(deseq_de)[-1], 
                         pAdjustMethod = "holm")

deseq_res_shrink <- lfcShrink(deseq_de, 
                                  coef = resultsNames(deseq_de)[-1], 
                                  type = "apeglm", 
                                  res = deseq_res)

deseq_res_df <- as.data.frame(deseq_res_shrink) %>% 
                    mutate(ensembl_id = rownames(.)) %>% 
                    left_join(gene_id_table, by = c("ensembl_id" = "ENSEMBL_ID")) %>% 
                    rename(gene = HGNC_SYMBOL) %>% 
                    relocate(ensembl_id, gene)
deseq_res_df_unshrink <- as.data.frame(deseq_res) %>% 
                    mutate(ensembl_id = rownames(.)) %>% 
                    left_join(gene_id_table, by = c("ensembl_id" = "ENSEMBL_ID")) %>% 
                    rename(gene = HGNC_SYMBOL) %>% 
                    relocate(ensembl_id, gene)
```

# Download DE analysis results

```{r}
library(downloadthis)
deseq_res_df %>% 
  download_this(
    output_name = "CD66b_Pseudobulk_Septic_vs_Healthy_DE",
    output_extension = ".xlsx",
    button_label = "Download results of Pseudobulk Septic vs. Healthy DE analysis",
    button_type = "default",
    has_icon = TRUE,
    icon = "fa fa-save"
  )
```

# Volcano plot 

```{r,fig.dim=c(6,8)}
library(ggplot2)
library(ggrepel)

# Extract pvalue, log2FoldChange, and gene name
volcano_data <- deseq_res_df_unshrink %>%
  select(ensembl_id, gene, pvalue, padj, log2FoldChange) %>%
  mutate(neg_log10_pvalue = -log10(pvalue))

library(EnhancedVolcano)

# Create volcano plot using EnhancedVolcano
p <- EnhancedVolcano(
  volcano_data,
  lab = volcano_data$gene, # Use the 'gene' column for labeling
  x = 'log2FoldChange',
  y = 'pvalue',
  title = "DE in Septic vs. Healthy",
 subtitle = "pseudobulk samples (CD66b+ cells)",

  xlab = "Log2 Fold Change", # Customize x-axis label
  ylab = "-Log10 P-value", # Customize y-axis label
  pCutoff = 0.05, # Set the p-value cutoff for significance
  FCcutoff = 1.0, # Adjust fold change cutoff as needed
  pointSize = 1.0, # Customize the point size
  labSize = 3.0, # Customize the label size
  drawConnectors = TRUE, # Draw lines connecting labels to points
  widthConnectors = 0.5, # Customize the width of connector lines
  selectLab = volcano_data %>% 
    arrange(padj) %>% 
    head(12) %>% 
    pull(gene) # Select the top 12 genes to label
)
print(p)
```

# Boxplots of significantly (padj < 0.05) DE genes

```{r,fig.dim=c(8,12)}
library(patchwork)
rlog_counts <- rlog(deseq_obj, blind = FALSE)
transformed_counts <- assay(rlog_counts)  # Use vst_counts for VST
significant_genes <- deseq_res_df %>% filter(padj < 0.05) %>% .$ensembl_id

# Assuming 'pb_list' is your pseudobulk expression matrix
df_long <- as.data.frame(t(transformed_counts[significant_genes, ]))
df_long$sample <- rownames(df_long)

# Merging with sample metadata to get sepsis status
df_long <- merge(df_long, sample_metadata, by = "sample", all.x = TRUE)

# Create list of boxplots
plot_list <- list()
for (ensembl_id in significant_genes) {
  gene_name <- deseq_res_df[deseq_res_df$ensembl_id == ensembl_id, "gene"]
  
  p <- ggplot(df_long, aes(x = sepsis_status, y = !!sym(ensembl_id), fill = sepsis_status)) +
       geom_boxplot() +
       labs(title = gene_name, x = "", y = "Normalized Expression") +
       theme_minimal() +
       scale_fill_brewer(palette = "Set2") +
       theme(legend.position = "none")
  
  plot_list[[gene_name]] <- p
}

# Determine the grid layout
num_plots <- length(plot_list)
num_cols <- 4
num_rows <- ceiling(num_plots / num_cols)

# Combine all plots into a grid
combined_plot <- wrap_plots(plot_list, ncol = num_cols) +
                 plot_annotation(title = "Boxplots of Significantly DE Genes by Sepsis Status")

# Print the combined plot
print(combined_plot)
```

# Gene expression changes (Septic vs. Healthy) in Oxidative Metabolism genes

Clay Mathews <clayton.mathews@pathology.ufl.edu> sent a list of 895 genes involved in oxidative metabolism to see if there are gene expression differences in CD66b+ cells between septic and healthy subjects. There were no significant differences in these genes, but boxplots for the ten genes with lowest p-values are shown, and the DE results for just these 895 genes can be downloaded.


```{r}
RM1_genes <- read.csv("RM1/RM1-Metabolism_Gene_List.csv")
RM1_results <- deseq_res_df %>% filter(gene %in% RM1_genes$Gene.Symbol)
deseq_res_df %>% 
  download_this(
    output_name = "CD66b_Pseudobulk_Septic_vs_Healthy_DE_OxMet_genes",
    output_extension = ".xlsx",
    button_label = "Download OxMet genes subset from results of Pseudobulk Septic vs. Healthy DE analysis",
    button_type = "default",
    has_icon = TRUE,
    icon = "fa fa-save"
  )
```

# Boxplots of top DE (not significant; top 10) oxidative metabolism genes 

```{r,fig.dim=c(8,12)}
RM1_top_genes <- RM1_results %>% arrange("pvalue",na.last=TRUE) %>% head(n=10) %>% .$ensembl_id

# Assuming 'pb_list' is your pseudobulk expression matrix
RM1_long <- as.data.frame(t(transformed_counts[RM1_top_genes, ]))
RM1_long$sample <- rownames(RM1_long)

# Merging with sample metadata to get sepsis status
RM1_long <- merge(RM1_long, sample_metadata, by = "sample", all.x = TRUE)

# Create the list of boxplots
plot_list2 <- list()

for (ensembl_id in RM1_top_genes) {
  gene_name <- RM1_results[RM1_results$ensembl_id == ensembl_id, "gene"]
  
  p <- ggplot(RM1_long, aes(x = sepsis_status, y = !!sym(ensembl_id), fill = sepsis_status)) +
       geom_boxplot() +
       labs(title = gene_name, x = "Sepsis Status", y = "Normalized Expression") +
       theme_minimal() +
       scale_fill_brewer(palette = "Set2") +
       theme(legend.position = "none")
  
  plot_list2[[gene_name]] <- p
}

# Determine the number of plots and layout
num_plots <- length(plot_list2)
num_cols <- 4
num_rows <- ceiling(num_plots / num_cols)

# Combine all plots into a grid using patchwork
combined_plot2 <- wrap_plots(plot_list2, ncol = num_cols) +
                  plot_annotation(title = "Boxplots of Top RM1 Gene Expressions by Sepsis Status")

# Print the combined plot
print(combined_plot2)
```

```{r}
# Custom function to print relevant session info
print_relevant_session_info <- function() {
  si <- sessionInfo()
  
  # R version
  cat("R version:", si$R.version$version.string, "\n")
  
  # Directly loaded packages (other attached packages)
  cat("\nDirectly loaded packages:\n")
  for (pkg in si$otherPkgs) {
    cat(pkg$Package, "version", pkg$Version, "\n")
  }
}

# Use the function
print_relevant_session_info()

```


